Introduction


This brief tutorial will show how to use Pyntacle from the command-line or its API interface to find relevant groups of nodes (also called here the kp-set) in a network using the fragmentation and reachability key-player metrics.

Fragmentation is the structural effect of removing nodes from a network. Reachability is the property of nodes to reach their direct or indirect neighbors in a network with a few connections. Although we will provide here a brief explanation of these concepts, we recommend reading the introduction to group centrality section.

All data used in this tutorial are available here for download.

Setting Pyntacle for the first use


After installing Pyntacle as described here and activating the Pyntacle's conda environment by typing

conda activate pyntacle_env

Note: pyntacle_env is just a naming convention

You can then check the Pyntacle version by typing:

pyntacle --version

Installation is correct if the following command ends succesfully:

In [1]:
import pyntacle

Otherwise, contact us and send us a description of the error, along with all the necessary information to reproduce the error.

Dataset description


The toy dataset used here is the network of advice-seeking ties among global consulting companies (also called as the AdvSeek network in this tutorial). This network was already used in the original paper by Stephen P. Borgatti, in which the key-player metrics are introduced, Identifying sets of key players in a social network.

fig8

We made this network available here as an adjacency matrix (cf. the supported file formats). Nodes in the network are labeled with the initials of the consultants. Since some of the consultants share the same initials, we appended progressive numbers to their initials (e.g. BS, BS2, BS3, etc) to be compliant with the Pyntacle Minimum Requirements. Edges represent relationships among consultants.

Command-line guide


Testing the Pyntacle command-line

To check that Pyntacle is properly installed and working, a set of unit tests can be run by typing:

pyntacle test

Several tests will be launched. If succesfull, this command will result in the following message:

Ran 27 tests in 21.002s

OK

Otherwise, please contact us specifying your operative system, its version, how you installed Pyntacle and the output of your tests in a text file. You can redirect the output of the tests this way:

pyntacle test >> testlog.txt 2>&1

Once verified the correct installation of Pyntacle, you can replicate some of the results of Borgatti’s original article in two steps.

kp-info: Computing the key-player metrics for a given set of nodes

First, the Pyntacle kp-info function will be used to measure the dF fragmentation index for a specific set of nodes of the network. Borgatti measured dF for the pair {HB, WD} and obtained a value of 0.817. We recall here that dF ranges from 0 to 1, where 0 means maximum connectedness (i.e., the network is a clique) and 1 means maximum disconnection (all nodes are isolates). This result can be replicated by typing:

pyntacle keyplayer kp-info -i AdvSeek.adjm -t dF --nodes HB,WD

This command returns the following output (we report the final summary only for the sake of readability):

****************************************** RUN SUMMARY *******************************************

Removing node set

(HB, WD)

gives a dF value of 0.81716

Starting graph dF was 0.64939

****************************************************************************************************

The resulting fragmentation can be better assessed comparing the dF of the original network (before node removal) with that calculated after the node removal. For this reason, kp-info returns also the dF value of the original network.

Results of kp-info can be saved in several file formats (default is tab-separated).

Note: you can specify a custom target directory using the --directory/-d argument

The representation of the AdvSeek network follows, with the removed nodes represented in purple. (Thicker edges around the kp-set highlight the {HB, WD} neighborhood)

kpinfo

kp-finder: Greedy-optimization search

The set of nodes of a particular size (k) that maximally fragment a network or that reach the highest number of other nodes can be obtained using the greedy-optimization search algorithm. It will search for a suboptimal set of nodes of size k for a selected key-player metrics. Starting from a random set of nodes and then swapping nodes with others outside the set until the selected key-player metrics cannot be improved, one dramatically reduces the number of operations at the cost of suboptimal solutions. This algorithm is particularly useful when dealing with large networks.

The key-player metric chosen in this tutorial is the m-reach, which is a measure of reachability. It aims at counting the number of nodes reached by a vertex set of size k in m steps or less. In Borgatti’s paper, the m-reach metric is calculated on the AdvSeek network for two different k sets and setting m to 1.

k Nodes reached % of network reached kp-set
2 15 53 {BM, BS3}
3 20 72 {BM, BS3, NP}

Note: The node BS is here called BS3 as there are 3 synonymous nodes in the original graph

To reproduce this result, type:

pyntacle keyplayer kp-finder -i AdvSeek.adjm -k 2 -m 1 -t mreach

kp-finder enables the use of the greedy-optimization search algorithm by default. This behaviour can be changed using the -I/--implementation argument and setting the brute-force (See the brute-force section) or sgd (stochastic-gradient descent) options.

Pyntacle will output (only the run summary is reported):

****************************************** RUN SUMMARY *******************************************

Size of the node set: 2

Key-player set of size 2 for m-reach, using at most 1 step, is:

(BS3, BM)

with value 15 on 32 (number of nodes reached on the total number of nodes)

The total percentage of nodes, which includes the kp-set, is 53.12%

****************************************************************************************************

Note: We report the number of nodes reached by the kp-set (m-reach) and the percentage of reached nodes, which includes the kp-set (as from Borgatti’s paper).

Results are stored in a txt file in the current working directory.

Similarly, the optimal set of size 3 can be sought typing:

pyntacle keyplayer kp-finder -i AdvSeek.adjm -k 3 -m 1 -t mreach

This will produce the following output:

****************************************** RUN SUMMARY *******************************************

Size of the node set: 3

Key-player set of size 3 for m-reach, using at most 1 step, is:

(KR, BM, NP)

with value 20 on 32 (number of nodes reached on the total number of nodes)

The total percentage of nodes, which includes the kp-set, is 71.88%

****************************************************************************************************

The {KR, BM, NP} kp-set is not the one reported by Borgatti {BM, BS3, NP}, despite their scores being identical. This means that this network holds more kp-sets of size 3 with equal fragmentation scores.

A way of getting all these sets would be to run the greedy-optimization search algorithm several times. But this will not guarantee to capture all existing kp-sets. Another, exact, way would be to switch to the Brute-force search algorithm, as we will discuss in the next section.

kp-finder: Brute-force search

Contrary to the greedy-optimization search algorithm, the brute-force search algorithm seeks and finds the best solutions at the price of high demand of computing resources and running times. This task requires eq1_bis operations, where N is the size of the graph.

However, it was implemented to calculate the desired metrics for all the combinations of size k of nodes, in parallel on multiple processors, if available. It can be enabled by setting the -I/--implementation parameter with the brute-force keyword. By default, Pyntacle will use 1 CPU only. However, this can be tuned by the -O/--nprocs parameter.

Let's consider again the AdvSeek network and the case discussed in the previous chapter. The greedy-optimization search did not replicate the Borgatti's findings on the AdvSeek network. Searching again the best kp-sets of size 3 with the brute-force search algorithm:

pyntacle keyplayer kp-finder -i AdvSeek.adjm -k 3 -m 1 -t mreach -I brute-force

****************************************** RUN SUMMARY *******************************************

Size of the node set: 3

Key player sets of size 3 for m-reach, using at most 1 step, are:

(KR, BM, NP)

(BS3, BM, NP)

with value 20 on 32 (number of nodes reached on the total number of nodes)

The total percentage of nodes, which include the kp-set, is 71.88%

****************************************************************************************************

The two best solutions were finally found. This concludes the quick start guide of Pyntacle via command-line. The same problems will be tackled using the Pyntacle's API interface in two ways: using the Octopus wrapper and by directly using the Pyntacle methods for kp-search.

Pyntacle library guide


The APIs are designed for intermediate-to-expert Python users, with a basic knowledge of object-oriented programming and some experience with the igraph package. If you are not acquainted with igraph basics, we recommend reading its official Python tutorial.

Pyntacle is built around igraph and performs its calculations on instances of igraph.Graph objects. Pyntacle provides several tools for importing/exporting from/to igraph.Graph objects and to/from several textual network file formats. We refer to the File Formats Specifications page for more details on this regard. Moreover, we recommend reading the Minimum Network Requirements.

Importing a network using Pyntacle

Networks can be imported from file via the PyntacleImporter class of the io_stream module. This class contains a series of handy methods to parse and store an input graph into an igraph.Graph object and initialize all the Pyntacle reserved attributes (see the Minimum Network Requirements section).

Considering again the AdvSeek network, which is available as an adjacency matrix (AdvSeek.adjm), it is imported with the following command:

In [2]:
from pyntacle.io_stream.importer import PyntacleImporter
adv = PyntacleImporter.AdjacencyMatrix("AdvSeek.adjm")
Adjacency matrix from AdvSeek.adjm imported

The imported adv object is of type igraph

In [3]:
type(adv)
Out[3]:
igraph.Graph

The adv object can thus be inspected by means of standard igraph methods:

In [4]:
adv.summary()
Out[4]:
"IGRAPH UN-- 32 55 -- ['AdvSeek']\n+ attr: isolates (g), name (g), sif_interaction_name (g), name (v), parent (v), adjacent_nodes (e), sif_interaction (e)"

Octopus: A convenient wrapper of Pyntacle methods

Octopus' methods are included in the Octopus class of the tools module. This is the list of all available Octopus methods:

In [5]:
from tools.octopus import Octopus
dir(Octopus)
Out[5]:
['BF_F',
 'BF_dF',
 'BF_dR',
 'BF_group_betweenness',
 'BF_group_closeness',
 'BF_group_degree',
 'BF_mreach',
 'F',
 'GO_F',
 'GO_dF',
 'GO_dR',
 'GO_group_betweeness',
 'GO_group_closeness',
 'GO_group_degree',
 'GO_mreach',
 'SGD_F',
 'SGD_dF',
 'SGD_dR',
 'SGD_group_betweeness',
 'SGD_group_closeness',
 'SGD_group_degree',
 'SGD_mreach',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'average_closeness',
 'average_clustering_coefficient',
 'average_degree',
 'average_eccentricity',
 'average_global_shortest_path_length',
 'average_radiality',
 'average_radiality_reach',
 'average_shortest_path_length',
 'betweenness',
 'closeness',
 'clustering_coefficient',
 'compactness',
 'completeness',
 'completeness_naive',
 'components',
 'dF',
 'degree',
 'density',
 'diameter',
 'eccentricity',
 'eigenvector',
 'group_betweenness',
 'group_closeness',
 'group_degree',
 'kp_F',
 'kp_dF',
 'kp_dR',
 'kp_mreach',
 'median_global_shortest_path_length',
 'median_shortest_path_length',
 'pagerank',
 'pi',
 'radiality',
 'radiality_reach',
 'radius',
 'shortest_paths',
 'weighted_clustering_coefficient']

When calling an Octopus function, Octopus will execute the corresponding Pyntacle function, will assign a new attribute to the target graph and will return the result of the called function. For example, let us suppose one wants to compute the average degree of the AdvSeek network.

In [6]:
Octopus.average_degree(adv)
Calculating the average degree and storing in the graph 'average_degree' attribute.

This function will trigger the computation of the average degree, will create an average_degree attribute for the graph adv and will set the result (3.4375 for the AdvSeek Network) to it:

In [7]:
adv.attributes()
Out[7]:
['name', 'sif_interaction_name', 'isolates', 'average_degree']
In [8]:
adv["average_degree"]
Out[8]:
3.4375

Octopus: Compute key-player metrics for a given set of nodes

With the aim to reproduce Borgatti's results regarding the network fragmentation on the AdvSeek network, as already discussed in the Command-line section, we will use Octopus.

In [9]:
Octopus.kp_dF(adv, nodes=["HB", "WD"])
Removing node set HB,WD from the graph, measuring fragmentation using the dF index and storing in the graph 'dF_info' attribute.
Computing dF for nodes (HB, WD)
Elapsed Time: 0.00 sec
Initializing 'dF_info' attribute.

where nodes is a list of names of the nodes belonging to the kp-set.

Octopus will add a new attribute, dF_info, to the graph adv. This attribute is actually a dictionary, where each item is a key-value pair made by a (sorted) tuple of nodes (the kp-set) and the corresponding dF value.

In [10]:
adv.attributes()
Out[10]:
['name', 'sif_interaction_name', 'isolates', 'average_degree', 'dF_info']

and

In [11]:
adv["dF_info"]
Out[11]:
{('HB', 'WD'): 0.81716}

dF is a relative metrics, in the sense that the effect of removing a set can be appreciated if compared with the fragmentation level of the original network. The initial dF value of a network can be computed by using the dF method:

In [12]:
Octopus.dF(adv)
Calculating the Fragmentation status of the graph using the dF (distance-based fragmentation) index and storing in the graph 'F' attribute.

The initial dF is stored in the corresponding dF attribute

In [13]:
adv["dF"]
Out[13]:
0.64939

Now, we can conclude that removing HB and WD will result in a fragmentation increase of 17%.

Octopus: Greedy-optimization search

The previous greedy-optimization search method, which was performed by the kp-finder command line command, can be replicated calling the GO_mreach method.

In [14]:
Octopus.GO_mreach(adv,k=2, m=1)
Finding an optimal node set of size 2 for reachability index m-reach with a distance of 1 using a greedily-optimized search and storing the set and its value in the 'mreach_1_greedy' attribute.
Optimal group: {BM, BS3}
 Group size = 2
 Metric = mreach
 Score = 15
Elapsed Time: 0.01 sec
Initializing 'mreach_1_greedy' attribute.

Like in the previous section, Octopus adds an attribute to the graph that stores a dictionary of node names (kp-set) and centrality measures. In this example, the attribute will be called mreach_1_greedy, since the centrality metrics is the m-reach with m=1 argument and obtained with a run of the greedy-optimization search algorithm.

In [15]:
adv.attributes()
Out[15]:
['name',
 'sif_interaction_name',
 'isolates',
 'average_degree',
 'dF_info',
 'dF',
 'mreach_1_greedy']

The name of the attribute says that the m-reach metric with m = 1 was computed with the greedy-optimization algorithm. This will ensure that all the following greedy-optimization searches for m-reach of distance 1 will be stored in this variable, keeping tracks of all the runs.

As before, we can obtain the key-values:

In [16]:
adv["mreach_1_greedy"]
Out[16]:
{('BM', 'BS3'): 15}

Octopus: Brute-force search

Finally, the brute-force search method performed using the command-line can be replicated with Octopus using $5$ CPU cores.

In [17]:
Octopus.BF_mreach(adv, k=3, m=1, nprocs=5)
Finding the best node set(s) of size 3 for reachability index m-reach with a maximum distance of 1 using a brute-force search and storing the set(s) and its value in the 'mreach_1_bruteforce' attribute.
Evaluating 4960 possible solutions
Brute-force search of the best kp-set of size 3 using 5 processes
Best groups: {[BM, BS3, NP], [BM, KR, NP]}
 Groups size = 3
 Metric = mreach, m=1
 Score = 20
Elapsed Time: 7.03 sec
Initializing 'mreach_1_bruteforce' attribute.

Again, a new attribute storing the results is added to the adv graph. This attribute is stored with the name mreach_1_bruteforce. It will contain a tuple of $3$ elements storing the node names as key and the corresponding m-reach values.

In [18]:
adv["mreach_1_bruteforce"]
Out[18]:
{(('BM', 'BS3', 'NP'), ('BM', 'KR', 'NP')): 20}

Exporting the igraph.Graph object

Any igraph.Graph object can be saved to text or binary files (we refer again to the File Formats Specifications page). Additionally, graphs can be exported with all their attributes. The export utilities are implemented in the PyntacleExporter class of the io_stream module. Let us export the adv network to a binary file.

In [19]:
from pyntacle.io_stream.exporter import PyntacleExporter
PyntacleExporter.Binary(adv, "AdvSeek.graph")
Graph successfully exported to binary at full path:
C:\Users\t.mazza\Desktop\CSS-Bioinformatics\pyntacle\pyntacle.css-mendel.it\resources\tutorials\quickstart_guide\AdvSeek.graph

This will create a file named advseek.graph in the current directory.

Key-player search without Octopus: Brute-force search

The same functions performed with Octopus can also be performed using the Pyntacle's APIs (algorithms and tools modules). These methods rely on an array of enumerators by which specifying:

  • the key-player metrics to be calculated;
  • the computing modes for some of the Pyntacle’s methods (i.e., serial, parallel CPU, GPU);

These enumerators are implemented in the tools module and can be imported as:

In [20]:
from tools.enums import KpposEnum #this enumerator stores all the reachability metrics
from tools.enums import CmodeEnum #this one handles all the computing modes

KpposEnum currently includes two reachability metrics: m-reach and dR.

In [21]:
dir(KpposEnum)
Out[21]:
['__class__', '__doc__', '__members__', '__module__', 'dR', 'mreach']

CmodeEnum contains four values: auto, igraph, cpu and gpu; auto is the default choice and commands Pyntacle to choose the best computing mode, according to the specific features of the graph is working on. The igraph option is chosen to run the serial versions of the Pyntacle algorithms; cpu is used to enable fine-grained, multi-threading, parallelism on the enumeration of shortest-paths by relying on just-in-time compilation of Pyntacle's methods by Numba. Finally, gpu is used to defer computationally heavy functions to be executed on CUDA-enabled NVIDIA graphics cards, if available. The value of this enumerator can be passed to the implementation parameter of any key-player methods.

In [22]:
dir(CmodeEnum)
Out[22]:
['__class__',
 '__doc__',
 '__members__',
 '__module__',
 'auto',
 'cpu',
 'gpu',
 'igraph']

For example, the brute-force search algorithm executed with Octopus can be replicated using the BruteForceSearch method:

In [23]:
from algorithms.bruteforce_search import BruteforceSearch  as bfs
In [24]:
bfs.reachability(graph=adv, cmode=CmodeEnum.igraph, metric=KpposEnum.mreach, m=1, k=3)
Evaluating 4960 possible solutions
Brute-force search of the best kp-set of size 3
Best groups: {[BM, BS3, NP], [BM, KR, NP]}
 Groups size = 3
 Metric = mreach, m=1
 Score = 20
Out[24]:
([['BM', 'BS3', 'NP'], ['BM', 'KR', 'NP']], 20)

Please contact us to leave a feedback.